Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
|
@claude solve the merge conflicts |
|
Claude finished @functionstackx's task in 2m 0s —— View job Merge Conflicts Resolved
Pushed commit |
Resolve merge conflicts in 3 files: - .github/configs/nvidia-master.yaml: keep expanded search space from PR - benchmarks/single_node/minimaxm2.5_fp8_b200.sh: keep --kv-cache-dtype fp8 and --stream-interval 20 with --no-enable-prefix-caching from main - perf-changelog.yaml: keep both main's new entries and PR's entry Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
functionstackx
left a comment
There was a problem hiding this comment.
hi @faradawn @kedarpotdar-nv can u update & merge ur existing recipe PR first before we merge this inferencex PR? vllm-project/recipes#272
usually we allow the inferencex PR to merge first as the recipe PR usually merges as an fast follow, but the reason why the recipes PR needs to merge first is cuz the previous PR's recipe update PR still didnt merge..
|
@claude remove isl 1024 osl: 8192 |
|
Claude finished @functionstackx's task in 57s —— View job Remove ISL 1024 OSL 8192 Configuration
Removed the ISL 1024 / OSL 8192 seq-len config (with tp:2 and tp:4 search-space entries) from Pushed commit |
Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
|
validated run here https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23694341288 |
|
@kedarpotdar-nv can we not use an nightly image here? Minimax M2 arch been around for multiple months now. And from looking at the master.yaml files, minimax is all using release image |
|
@Klaud-Cold please remove export VLLM_USE_DEEP_GEMM=0 from minimax b200 fp8 vllm script. |
|
@kedarpotdar-nv it is @ claude |
@functionstackx this PR is ready for review. we are using release 0.19 continaer now |
|
@claude add pr description in accordance with perf changelog |
.github/configs/nvidia-master.yaml
Outdated
| - { tp: 2, ep: 2, conc-start: 4, conc-end: 256 } | ||
| - { tp: 4, conc-start: 4, conc-end: 512 } | ||
| - { tp: 4, ep: 4, conc-start: 16, conc-end: 64 } | ||
| # - { tp: 8, conc-start: 4, conc-end: 8 } |
There was a problem hiding this comment.
|
Claude finished @cquil11's task in 39s —— View job Add PR Description
PR description has been updated with a summary of all changes from the perf-changelog and diffs: image upgrade (v0.17.0 → v0.19.0), search space expansions, new EP entries, removed seq-len config, and all benchmark script parameter changes. |
| export VLLM_FLASHINFER_ALLREDUCE_BACKEND=mnnvl | ||
|
|
||
| if [ "$EP_SIZE" -ge 1 ]; then | ||
| if [ "$EP_SIZE" -gt 1 ]; then |
perf-changelog.yaml
Outdated
| - config-keys: | ||
| - minimaxm2.5-fp8-b200-vllm | ||
| description: | ||
| - "Update vLLM image from v0.17.0 to v0.18.0 for MiniMax-M2.5 FP8 B200" |
There was a problem hiding this comment.
update to 0.19.0 @claude
also change pr desc to reflect this
There was a problem hiding this comment.
Claude finished @cquil11's task in 52s —— View job
Update perf-changelog to 0.19.0 & PR Description
- Read perf-changelog.yaml to find the entry to update
- Update version from
v0.18.0→v0.19.0in perf-changelog.yaml - Update PR description to reflect v0.19.0
- Commit and push
Updated the perf-changelog entry (v0.18.0 → v0.19.0) and the PR description. Pushed commit 0d734c1 to nv/minimax-vllm018.
Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>
Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>

Summary
Upgrade MiniMax-M2.5 FP8 B200 vLLM benchmark configuration from v0.17.0 to v0.19.0 with expanded search space and tuned serving parameters.
Changes
Image Upgrade
v0.17.0-cu130tov0.19.0-cu130Search Space Updates (
nvidia-master.yaml)Benchmark Script Updates (
minimaxm2.5_fp8_b200.sh)VLLM_USE_FLASHINFER_MOE_FP8=0andVLLM_MOE_USE_DEEP_GEMM=0env varsVLLM_FLASHINFER_ALLREDUCE_BACKEND=mnnvl-gt 1instead of-ge 1)--kv-cache-dtype fp8--max-cudagraph-capture-size 2048--max-num-batched-tokensbased on ISL--stream-interval 20--gpu-memory-utilizationfrom 0.95 to 0.90Validated Run
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23694341288